Fix test suite collection and stale expectations by ceej640 · Pull Request #23 · gently-project/gently

ceej640 · 2026-05-31T00:41:38Z

Summary

make the live campaign coordination script skip under pytest unless explicitly enabled
restore the text tool-call fallback parser expected by tests
align DiSPIM safety tests with fixed XY hardware bounds and current camera activation API
avoid scheduling mesh status fetches when discovery hooks are called without a running event loop

Related issues

Partially addresses Feature: Testing suite for hardware conditions and hardware-dependent code #18 by making the existing hardware-adjacent suite usable under pytest.
Supports the testing/evaluation roadmap in [Feature] Rigorous copilot testing framework and domain-organized tool hierarchy #4 and [Feature] Automated benchmark suite for copilot evaluation #8 by restoring a passing regression baseline.

Verification

uv run pytest -q
- 554 passed, 4 skipped

pskeshu · 2026-05-31T18:36:05Z

Hi CeeJay, and CJs agent(s), greetings! Welcome to gently. To set the larger context, we are developing and testing interactions between Augmented LLMs, Microscopes and Samples environments. At the moment, we are critical on shipping to users ryan and brie, who would be interacting with the system for their tasks with the DiSPIM. I would have a detailed disussison on this with them at some point, but with the discussisons so far, it seems like - 1. the system is used for imaging embryos - the illuminating objective uses the galvos to steer a pencil to a sheet, and then across the microscope, and as the sheet traverses the scope, we use the other collecting objective perpendicular to that which is moved up and down to match the speed of the galvo with the help of its piezos. for this calibration is done by ryan and bre -- they find the embryos individually in xy by looking at an overview camera in the bottom which has low mag to allow for a field of large enough to cover the entire sample area. then we raise the stage in z(or bring the head axis (F in 2015 paper A. Kumar, older implementation) closer to stage). Then they calibrate - with a procedure we have captured as calibraiton_tools.py where there area room for improvements - and can be easily tested in isolation for a given embryo xy coords and post brie.ryan f axis alignment. So can be tested in scale. And also seeks the urgency to automate the f drive focus finding - while lower the spim head into the sample focus is acquired reliably without crashing the spim head on the glassslide that exits if we go on the focus beyond the embryo which is sticking to the glass slide coated with poly lysine in a very small area that keeps the eggs from sticking and staying there, instead of floating away until hatching. so, once the calibration stuff is set, either by brie or ryan or gently at the moment, we decide on a timelapse. this is the technical layer of gently.

refer to paper kesavan, pontus, 2026, journal of microscopy - the concept of the heirarchy of levels in microscope thought model - with an illustrated example in cancer migration, with figure. the core idea is that there is a planning layer facing technical implemnets perpective to look at data flow. where planning thought elemnets shape how a microscopeists works with any entity - therefore, i want to ask, if i can give the biologist that as their primary layer of interaction with any microscope, and specificlally our microscope - dispim, looking at an organism.

for these reasons, I want to know how to shape our plannig implementation and improve its existing structures. It works quite well for what exists already, and has gotten us into plans that were admired by biologists across fields at janelia, so it should be okay to think that there is some quality to the existing system - and we want to improve it and take it to its further potential. inacluding poetnitally of including say other modalities - like robotic sample preparations (maybe that's too ambitious) but other modalities that goes into the inteeliggent act of runnign an experiment.

I guess that's it for now. I want to see if you can make a single PR with these new information, and we can reiterate from there :)

ceej640 · 2026-06-01T00:25:02Z

Thanks, I took this as a request for a single iteration PR using the new context you provided here. I read the Kesavan/Nordenfelt smart-microscopy framework and mapped the technical/experimental/theoretical/conceptual hierarchy into Gently plan mode in #32.

That PR adds a first-class plan_context on plan items, persists it through the context stores/templates, teaches the plan-mode prompt/tools to use it, adds validation warnings for missing hierarchy/F-drive calibration-safety context, and includes a short design doc for iteration.

pskeshu · 2026-06-01T01:26:24Z

Thank you very much for this synthesis. Before we proceed further, two questions: 1. How long does it take to run a plan synthesis call? How clean is the ui/ux in runtime for a plan mode? This can be tested with a browser mcp, while running gently without device layer, and using the browser mcp interface to feed actions into the chat/interface making the plan and giving it an intent and seeing the *quality* of the plan it produces. 2. How generalisable is the plan mode outside of dispim, celegans, and imaging itself. Best, Kesavan

…

On Sun, 31 May, 2026, 8:25 pm ceej640, ***@***.***> wrote: *ceej640* left a comment (gently-project/gently#23) <#23 (comment)> Thanks, I took this as a request for a single iteration PR using the new context you provided here. I read the Kesavan/Nordenfelt smart-microscopy framework and mapped the technical/experimental/theoretical/conceptual hierarchy into Gently plan mode in #32 <#32>. That PR adds a first-class plan_context on plan items, persists it through the context stores/templates, teaches the plan-mode prompt/tools to use it, adds validation warnings for missing hierarchy/F-drive calibration-safety context, and includes a short design doc for iteration. — Reply to this email directly, view it on GitHub <#23?email_source=notifications&email_token=ABVNN4HGT5Y2DKJATGMT5WL45TEPHA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHA3DSMZYGA2KM4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4588693804>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVNN4BF7DWOHJAOGI46Z5L45TEPHAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBYGY4TGOBQGQ> . You are receiving this because you were assigned.Message ID: ***@***.***>

ceej640 · 2026-06-01T01:42:22Z

Thanks, these are the right questions to ask before taking the planning work further. I would separate what I can infer from the code today from what still needs an actual browser/runtime measurement.

1. Plan synthesis runtime and plan-mode UI/UX

I do not yet have a measured browser-MCP runtime number. Based on the code, plan synthesis is not one clean atomic call; it is a conversational LLM turn plus sequential tool calls such as create_campaign, create_plan_item, link_plan_items, propose_plan, and sometimes literature/strain/lab-history lookups.

My expectation before measuring:

Simple plan, no literature/database lookups: likely tens of seconds to around 1-2 minutes.
Literature-grounded, multi-phase plan with strain search/read-paper steps: likely several minutes.
Running without the device layer should remove microscope/device risk, but it does not remove LLM/API/tool latency.

The UI/UX is structurally present: /plan enters plan mode, chat drives synthesis, and the Plans tab has document/board/graph/timeline views for inspection. I would not yet call the runtime UX clean without doing exactly the browser test you suggest. The likely weak points are flow issues rather than the model objects: whether users understand that planning is happening, whether they get enough progress feedback while waiting, whether the generated plan appears clearly in the Plans tab, and whether refinement feels natural.

A useful browser test should measure:

time to first useful plan-mode response
time from intent to visible structured plan
number of turns needed before propose_plan
whether plan items include controls, decision points, dependencies, specs, references, and now plan_context
whether the Plans UI makes the result inspectable without relying on chat history

2. Generalisability outside DiSPIM, C. elegans, and imaging

Architecturally, plan mode is more general than DiSPIM/C. elegans/imaging. The strong general pieces are:

Campaign / phase hierarchy
typed PlanItems: imaging, bench, genetics, analysis, decision point
dependencies, status, snapshots, references
generic lab-history querying
organism/hardware injection via get_organism() and get_hardware()
the new PlanContext, which is intentionally modality-agnostic: technical, experimental, theoretical, conceptual context plus constraints and operator/sample context

But the current implementation is still biased toward the original use case:

ImagingSpec encodes embryo/timelapse/light-sheet-style assumptions.
search_strains is C. elegans/WormBase/CGC-specific.
validation still has DiSPIM-ish hardware limits and C. elegans stage logic.
prompt examples and quality expectations are strongest for C. elegans embryo imaging.

So my answer is: plan mode is conceptually generalisable, and #32 moves it in that direction, but it is not yet fully generalised as software. To support other microscopes, organisms, or robotic sample prep cleanly, Gently needs modality/organism-specific specs and validators to become plugin/capability-driven rather than hard-coded into plan-mode validation and research tools.

The next useful step is probably empirical rather than another abstract planning PR: run the offline/browser plan-mode benchmark you described, record runtime and UX friction, inspect the generated plan quality, and then use those results to decide whether the next PR should target UI feedback, plan quality, or the generalisation boundary.

pskeshu · 2026-06-01T02:09:10Z

Gotcha. would love to see you explore plan mode empirically. Let me know if you run into any issues. Do let me know if you are able to set up gently locally.

…

On Sun, 31 May 2026 at 21:42, ceej640 ***@***.***> wrote: *ceej640* left a comment (gently-project/gently#23) <#23 (comment)> Thanks, these are the right questions to ask before taking the planning work further. I would separate what I can infer from the code today from what still needs an actual browser/runtime measurement. 1. Plan synthesis runtime and plan-mode UI/UX I do not yet have a measured browser-MCP runtime number. Based on the code, plan synthesis is not one clean atomic call; it is a conversational LLM turn plus sequential tool calls such as create_campaign, create_plan_item, link_plan_items, propose_plan, and sometimes literature/strain/lab-history lookups. My expectation before measuring: - Simple plan, no literature/database lookups: likely tens of seconds to around 1-2 minutes. - Literature-grounded, multi-phase plan with strain search/read-paper steps: likely several minutes. - Running without the device layer should remove microscope/device risk, but it does not remove LLM/API/tool latency. The UI/UX is structurally present: /plan enters plan mode, chat drives synthesis, and the Plans tab has document/board/graph/timeline views for inspection. I would not yet call the runtime UX clean without doing exactly the browser test you suggest. The likely weak points are flow issues rather than the model objects: whether users understand that planning is happening, whether they get enough progress feedback while waiting, whether the generated plan appears clearly in the Plans tab, and whether refinement feels natural. A useful browser test should measure: - time to first useful plan-mode response - time from intent to visible structured plan - number of turns needed before propose_plan - whether plan items include controls, decision points, dependencies, specs, references, and now plan_context - whether the Plans UI makes the result inspectable without relying on chat history 2. Generalisability outside DiSPIM, C. elegans, and imaging Architecturally, plan mode is more general than DiSPIM/C. elegans/imaging. The strong general pieces are: - Campaign / phase hierarchy - typed PlanItems: imaging, bench, genetics, analysis, decision point - dependencies, status, snapshots, references - generic lab-history querying - organism/hardware injection via get_organism() and get_hardware() - the new PlanContext, which is intentionally modality-agnostic: technical, experimental, theoretical, conceptual context plus constraints and operator/sample context But the current implementation is still biased toward the original use case: - ImagingSpec encodes embryo/timelapse/light-sheet-style assumptions. - search_strains is C. elegans/WormBase/CGC-specific. - validation still has DiSPIM-ish hardware limits and C. elegans stage logic. - prompt examples and quality expectations are strongest for C. elegans embryo imaging. So my answer is: plan mode is conceptually generalisable, and #32 <#32> moves it in that direction, but it is not yet fully generalised as software. To support other microscopes, organisms, or robotic sample prep cleanly, Gently needs modality/organism-specific specs and validators to become plugin/capability-driven rather than hard-coded into plan-mode validation and research tools. The next useful step is probably empirical rather than another abstract planning PR: run the offline/browser plan-mode benchmark you described, record runtime and UX friction, inspect the generated plan quality, and then use those results to decide whether the next PR should target UI feedback, plan quality, or the generalisation boundary. — Reply to this email directly, view it on GitHub <#23?email_source=notifications&email_token=ABVNN4DDH3GNEJPNNKDRUQ345TNRJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHA4TINRRGIZKM4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4588946122>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVNN4EDC7HQSR2NQG5DJJ345TNRJAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBYHE2DMMJSGI> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Kesavan

ceej640 · 2026-06-01T04:20:45Z

I tried to run the empirical plan-mode benchmark you suggested. Short version: I got partway through local setup and found a few concrete blockers, so I could not honestly report a completed plan-synthesis quality/runtime benchmark yet.

What I did:

Started from the current PR branch and tried to launch Gently offline with disposable benchmark storage, no device layer, and no browser auto-open.
Checked whether an Anthropic key was available in the process, user, or machine environment.
Tried to bring up the web UI locally and exercise plan mode.
Because the browser-MCP Node REPL surface was not exposed in my session and Playwright was not installed locally, I drove the same /ws/agent WebSocket protocol that the browser chat uses.
Recorded a preflight run with the DiSPIM/F-drive/embryo intent we discussed.

Findings:

A true synthesis run is currently blocked in this environment because ANTHROPIC_API_KEY is not set anywhere I can access. The first actual chat synthesis turn reaches Anthropic and then fails with 401 invalid x-api-key.
Local offline startup also failed initially because gently_perception is imported at startup but is not installed here and is not declared in pyproject.toml / requirements.txt. I used a disposable local shim only to continue UI preflight; that means this was not a clean production-equivalent launch.
Once past that, the web root failed with a current Starlette/FastAPI compatibility issue: TemplateResponse is now TemplateResponse(request, name, context), while a few routes still used the older positional form. I fixed that in Shape plan mode around microscope context hierarchy #32 because without it the offline browser benchmark cannot even load the UI.
After that route fix, the SPA root rendered successfully: HTTP 200, about 47 KB HTML.
The chat WebSocket connected and /plan mode switching worked.

Measured preflight timings, not plan-synthesis timings:

WebSocket open: ~0.087 s
connected message after open: ~0.002 s
/plan command response: ~0.004 s
First synthesis turn terminated after ~0.357 s with Anthropic auth failure

So: I was able to partially set up Gently locally enough to start the server, render the web UI, connect the chat WebSocket, and enter plan mode. I was not able to complete the benchmark you asked for because the environment lacks a real Anthropic API key, lacks the gently_perception dependency, and does not expose the requested browser-MCP control surface here.

The useful outcome is that this preflight found a real UI startup blocker, and I pushed the fix to #32. The remaining work to complete your requested benchmark is:

Make gently_perception available or make it optional for offline plan-mode startup.
Provide a real ANTHROPIC_API_KEY in the environment.
Re-run the same intent through the browser UI and record actual plan quality, tool-call organization, time to first useful response, time to completed plan, and Plans-tab inspectability.

I agree with your newer comment on #32 that this now turns into a design question: plan generation needs to be faster and more intentionally structured, probably by reducing tool-call chatter around campaign/phase/task creation while preserving enough structure to enrich the biologist's discovery workflow rather than turning it into a rigid form.

ceej640 · 2026-06-01T04:22:56Z

Thanks. I did try to set Gently up locally and run the empirical offline plan-mode benchmark.

Short version: I got to a useful preflight, but not a true plan-quality benchmark yet.

What I did:

Started from the PR Shape plan mode around microscope context hierarchy #32 branch and tried python launch_gently.py --offline --no-browser with disposable storage outside the repo.
The first launch failed before the UI came up because gently_perception is imported at startup but is not installed in this environment, and I could not find it declared in pyproject.toml or requirements.txt.
To continue only the UI preflight, I used a temporary local shim for gently_perception; that was not a repo change.
With that shim, the web server started but / returned HTTP 500 because the routes were using the older positional TemplateResponse call shape. The installed Starlette/FastAPI expects TemplateResponse(request, name, context).
I fixed that route compatibility issue in Shape plan mode around microscope context hierarchy #32 and pushed commit aa79f0f. After that, / returned HTTP 200.
Since browser MCP/node_repl was not exposed in this Codex session and Playwright is not installed here, I drove the same /ws/agent protocol the UI uses as a fallback.

Findings from that preflight:

WebSocket open: ~0.087s
initial connected message: ~0.002s after open
/plan command response: ~0.004s
/plan successfully switched to plan mode and returned: "Switched to plan mode. I'm now your experimental design collaborator."
The first synthesis chat turn stopped after ~0.357s with Anthropic 401 invalid x-api-key.

Why I could not complete the exact benchmark you asked for:

There is no ANTHROPIC_API_KEY in the process/user/machine environment here, so true plan synthesis cannot run.
The missing gently_perception dependency means local offline startup is not clean unless that package is installed, declared, or made optional for plan-only/offline use.
Browser MCP control was not available in this session, and Playwright is absent, so I could not complete a visible browser interaction/screenshot benchmark; I used a WebSocket fallback instead.

So the honest status is: I was able to set Gently up only partially. After the route fix in #32, the app can serve the web UI and /plan mode switching works, but I cannot report real plan synthesis latency or plan quality until the API key plus setup/tooling gaps are resolved.

This also reinforces the next design point from #32: before optimizing the plan generation UX, we need the local offline plan-mode path to be reliable, then benchmark the actual campaign -> phase -> task creation flow.

pskeshu · 2026-06-01T18:08:46Z

Great feedback! Will start an issue on this soon.

…

On Mon, 1 Jun 2026 at 00:23, ceej640 ***@***.***> wrote: *ceej640* left a comment (gently-project/gently#23) <#23 (comment)> Thanks. I did try to set Gently up locally and run the empirical offline plan-mode benchmark. Short version: I got to a useful preflight, but not a true plan-quality benchmark yet. What I did: - Started from the PR #32 <#32> branch and tried python launch_gently.py --offline --no-browser with disposable storage outside the repo. - The first launch failed before the UI came up because gently_perception is imported at startup but is not installed in this environment, and I could not find it declared in pyproject.toml or requirements.txt. - To continue only the UI preflight, I used a temporary local shim for gently_perception; that was not a repo change. - With that shim, the web server started but / returned HTTP 500 because the routes were using the older positional TemplateResponse call shape. The installed Starlette/FastAPI expects TemplateResponse(request, name, context). - I fixed that route compatibility issue in #32 <#32> and pushed commit aa79f0f. After that, / returned HTTP 200. - Since browser MCP/node_repl was not exposed in this Codex session and Playwright is not installed here, I drove the same /ws/agent protocol the UI uses as a fallback. Findings from that preflight: - WebSocket open: ~0.087s - initial connected message: ~0.002s after open - /plan command response: ~0.004s - /plan successfully switched to plan mode and returned: "Switched to plan mode. I'm now your experimental design collaborator." - The first synthesis chat turn stopped after ~0.357s with Anthropic 401 invalid x-api-key. Why I could not complete the exact benchmark you asked for: - There is no ANTHROPIC_API_KEY in the process/user/machine environment here, so true plan synthesis cannot run. - The missing gently_perception dependency means local offline startup is not clean unless that package is installed, declared, or made optional for plan-only/offline use. - Browser MCP control was not available in this session, and Playwright is absent, so I could not complete a visible browser interaction/screenshot benchmark; I used a WebSocket fallback instead. So the honest status is: I was able to set Gently up only partially. After the route fix in #32 <#32>, the app can serve the web UI and /plan mode switching works, but I cannot report real plan synthesis latency or plan quality until the API key plus setup/tooling gaps are resolved. This also reinforces the next design point from #32 <#32>: before optimizing the plan generation UX, we need the local offline plan-mode path to be reliable, then benchmark the actual campaign -> phase -> task creation flow. — Reply to this email directly, view it on GitHub <#23?email_source=notifications&email_token=ABVNN4AYGGCCOAHSIGTUXLL45UALNA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJYHE2TIMRRGQ22M4TFMFZW63VGMFZXG2LHN2SWK5TFNZ2KYZTPN52GK4S7MNWGSY3L#issuecomment-4589542145>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABVNN4CJCHPMMSEMJBOI4L345UALNAVCNFSM6AAAAACZUBDUY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOBZGU2DEMJUGU> . You are receiving this because you were assigned.Message ID: ***@***.***>

-- Kesavan

Fix test suite collection and stale expectations

cb1ac02

This was referenced May 31, 2026

Harden web control auth and upload inputs #24

Open

Harden FileStore path and YAML loading #25

Open

Preserve setpoints and embryo positions #26

Open

pskeshu self-assigned this May 31, 2026

pskeshu marked this pull request as draft May 31, 2026 18:36

pskeshu added the good first issue Good for newcomers label May 31, 2026

ceej640 mentioned this pull request Jun 1, 2026

Shape plan mode around microscope context hierarchy #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test suite collection and stale expectations#23

Fix test suite collection and stale expectations#23
ceej640 wants to merge 1 commit into
gently-project:developmentfrom
ceej640:ceej/fix-test-health

ceej640 commented May 31, 2026 •

edited

Loading

Uh oh!

pskeshu commented May 31, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ceej640 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related issues

Verification

Uh oh!

pskeshu commented May 31, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

1. Plan synthesis runtime and plan-mode UI/UX

2. Generalisability outside DiSPIM, C. elegans, and imaging

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

ceej640 commented Jun 1, 2026

Uh oh!

pskeshu commented Jun 1, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ceej640 commented May 31, 2026 •

edited

Loading